##Data preparation
This shows the code that was run to prepare the data for the Machine Learning workshop at the OpenSeaLab Hackathon 2019. Note that for some steps you need to download the data manually.
However, you can do this also automatically by using webservices. We will update this document and include these automatic steps.
We are using the ‘Dutch national shellfish monitoring in the coastal zone’: http://www.emodnet-biology.eu/data-catalog?module=dataset&dasid=6204
See selection here, click ‘download’ http://www.emodnet-biology.eu/toolbox/en/download/selection/15d69183543357
library(sf)
library(mapview)
library(raster)
# data downloaded in /data/20190830_162455_15d6931b78ad5a.zip
unzip("data/20190830_162455_15d6931b78ad5a.zip",
exdir = "data")
# read csv
Dutch_shell_full <- read.csv("data/DATA_63zArr.csv")
# convert to sf object
Dutch_shell_sf <- st_as_sf(Dutch_shell_full,
coords = c('decimallongitude','decimallatitude'),
crs = 4326,
remove = FALSE)
# convert date-time column to POSIXct class
Dutch_shell_sf$datecollected <- as.POSIXct(Dutch_shell_sf$datecollected,
format="%Y-%m-%dT%H:%M")
# viz first 50 points on the map
mapview(Dutch_shell_sf[1:50,],
zcol = "yearcollected")
We will select different data other parameters for our measurements, therefore we need a bounding box of the measurements:
# determine min max coordinate values
# xmin <- st_bbox(Dutch_shell_sf)$xmin
# ymin <- st_bbox(Dutch_shell_sf)$ymin
# xmax <- st_bbox(Dutch_shell_sf)$xmax
# ymax <- st_bbox(Dutch_shell_sf)$ymax
bbox <- st_bbox(Dutch_shell_sf)
We want to retrieve the ship vessel density data (especially Fishing vessels) from EMODnet Human Activities. Have a look to the EMODnet Human Activities WCS service (raster layers)
https://ows.emodnet-humanactivities.eu/wcs?service=wcs&version=1.0.0&request=GetCapabilities
This returns a xml with all coverages, for example:
<wcs:ContentMetadata>
<wcs:CoverageOfferingBrief>
<wcs:description>Generated from GeoTIFF</wcs:description>
<wcs:name>emodnet:2017_01_st_00</wcs:name>
<wcs:label>2017_01_st_00</wcs:label>
...
</Layer>
In the example above we see the desciption of the coverage of
See the table below for all options
| Year_month | ship type |
|---|---|
| 2017_01 | st_00: Other |
| 2017_02 | st_01: Fishing |
| 2017_03 | st_02: Service |
| 2017_04 | st_03: Dredging or underwater ops |
| 2017_05 | st_04: Sailing |
| 2017_06 | st_05: Pleasure Craft |
| 2017_07 | st_06: High speed craft |
| 2017_08 | st_07: Tug and towing |
| 2017_09 | st_08: Passenger |
| 2017_10 | st_09: Cargo |
| 2017_11 | st_10: Tanker |
| 2017_12 | st_11: Military and Law Enforcement |
| … | st_12: Unknown |
| Year average: 2017_ [st_..]_avg | st_All: All ship types |
some examples of layers:
| Description | wcs coverage name |
|---|---|
| average of 2017 Sailing density: | emodnet:2017_st_04_avg |
| average of October Dredging: | emodnet:2017_10_st_03 |
| average of 2017 all types: | emodnet:2017_st_00_avg |
Let’s extract the 2017 average of Fishing:
coveragename <- 'emodnet:2017_st_01_avg'
url <- 'https://ows.emodnet-humanactivities.eu/wcs?service=wcs&version=1.0.0&request=getcoverage'
full_url <- paste(url,
'&coverage=', coveragename,
'&crs=EPSG:4326&BBOX=', paste(bbox, collapse = ","),
'&format=image/tiff',
'&interpolation=nearest&resx=0.00833333&resy=0.00833333',
sep = '')
fishing <- raster(full_url)
mapview(fishing)
Have a look to the EMODnet Bathymetry WCS service (raster layers)
https://ows.emodnet-humanactivities.eu/wcs?service=wcs&version=1.0.0&request=GetCapabilities
This returns a xml with all coverages, for example:
<wcs:ContentMetadata>
<wcs:CoverageOfferingBrief>
<wcs:description>Generated from GeoTIFF</wcs:description>
<wcs:name>emodnet:2017_01_st_00</wcs:name>
<wcs:label>2017_01_st_00</wcs:label>
...
</Layer>
Let’s extract the mean bathymetry:
coveragename <- 'emodnet:mean'
url <- 'https://ows.emodnet-bathymetry.eu/wcs?service=wcs&version=1.0.0&request=getcoverage'
full_url <- paste(url,
'&coverage=', coveragename,
'&crs=EPSG:4326&BBOX=', paste(bbox, collapse = ","),
'&format=image/tiff',
'&interpolation=nearest&resx=0.00833333&resy=0.00833333',
sep = '')
bathy <- raster(full_url)
mapview(bathy)
Extract rasterdata at locations:
Dutch_shell_sf$fishing <- extract(fishing, Dutch_shell_sf)
Dutch_shell_sf$bathy <- extract(bathy, Dutch_shell_sf)
Download substrate data from: https://www.emodnet-seabedhabitats.eu/access-data/download-data/ We downloaded the ‘EUSeaMap → Classified habitat descriptors → Substrate type’
library(sf)
library(raster)
library(ggplot2)
library(mapview)
library(fasterize) #https://github.com/ecohealthalliance/fasterize
# unzip file:
# unzip('data/eusm2019_substrate.zip',
# exdir = 'data')
# try with substrate
substrate <- st_read(dsn = 'data/EUSM2019_Substrate/EUSM2019_substrate.gdb',
layer = 'EUSM2019_Arctic_Atlantic_substrate')
## Reading layer `EUSM2019_Arctic_Atlantic_substrate' from data source `C:\Users\lennerts\LOCAL_FILES\OSL\MachineLearning_DataVIz_Workshop\data\EUSM2019_Substrate\EUSM2019_substrate.gdb' using driver `OpenFileGDB'
## Simple feature collection with 496869 features and 3 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -3446029 ymin: 5914810 xmax: 4786738 ymax: 16215280
## epsg (SRID): 3857
## proj4string: +proj=merc +a=6378137 +b=6378137 +lat_ts=0.0 +lon_0=0.0 +x_0=0.0 +y_0=0 +k=1.0 +units=m +nadgrids=@null +wktext +no_defs
st_crs(substrate)
## Coordinate Reference System:
## EPSG: 3857
## proj4string: "+proj=merc +a=6378137 +b=6378137 +lat_ts=0.0 +lon_0=0.0 +x_0=0.0 +y_0=0 +k=1.0 +units=m +nadgrids=@null +wktext +no_defs"
# subset data by North Sea:
# bounding box North Sea
# xmin, ymin, xmax, ymax
# -4.45, 51.00, 12.01, 61.02
NS <- st_polygon(list(rbind(c(-4.45,51.00),
c(12.01,51.00),
c(12.01,61.02),
c(-4.45,61.02),
c(-4.45,51.00))))
NS <- st_sfc(NS) %>% st_set_crs(NA) %>% st_set_crs(4326)
# transform coordinates to match habitats CRS
NS <- NS %>% st_transform(3857)
# cut (intersect) substrate by North Sea bbox
NS_substrate <- st_intersection(substrate, NS)
# the Seabed substrate map originates from a very detailed vector layer
# so convert polygons to raster on at aggregated resolution
# raster template
r_template <- raster(ncols = 791, nrows = 481,
crs = projection(NS_substrate),
ext = extent(NS_substrate),
vals = NULL)
# apparently not all polygon type (error with fasterize)
NS_substrate <- st_cast(NS_substrate, "POLYGON")
# Fasterize for amazingly fast sf to raster! see https://github.com/ecohealthalliance/fasterize
r_NS_substrate_f <- fasterize(NS_substrate, r_template, field = "Substrate")
# # convert to dataframe for ggplot
r_NS_Substrate_spdf <- as(r_NS_substrate_f, 'SpatialPixelsDataFrame')
r_NS_Substrate_df <- as.data.frame(r_NS_Substrate_spdf)
colnames(r_NS_Substrate_df) <- c('Substrate', 'lon', 'lat')
#
# plot(r_NS_Substrate, axes = TRUE)
#
# # Plot data in ggplot
ggplot() +
geom_tile(data=r_NS_Substrate_df, aes(x=lon, y=lat, fill=factor(Substrate))) +
scale_fill_viridis_d(breaks = as.character(1:length(levels(NS_substrate$Substrate))),
labels = levels(NS_substrate$Substrate),
name = 'Substrate')
Extract substrate data from rasters:
Dutch_shell_sf$substrate <- extract(r_NS_substrate_f , Dutch_shell_sf)
# add levels to substrate:
Dutch_shell_sf$substrate <- factor(Dutch_shell_sf$substrate,
levels = 1:length(levels(NS_substrate$Substrate)),
labels = levels(NS_substrate$Substrate))
We use other descriptors from the sdmpredictors package:
sdmpredictors contains several datasets, each containing several layers:
library(sdmpredictors)
# list datasets of sdmpredictors package
datatable(list_datasets())
# list layers of Bio-ORACLE dataset
datatable(list_layers(datasets="Bio-ORACLE"))
load all interesting layers:
# load some Bio-Oracle layers:
bo_layers <- load_layers(layercodes = c("BO_phosphate",
"BO_nitrate",
"BO_salinity",
"BO_sstmax",
"BO_sstmean",
"BO_dissox",
"BO2_curvelmax_bdmax", # Current velocity (maximum at max depth)
"BO2_lightbotmax_bdmax"), # Light at bottom (maximum at max depth)
rasterstack = TRUE )
Dutch_shell_sf <- cbind(Dutch_shell_sf,
extract(bo_layers, Dutch_shell_sf))
Get coastline from rnaturalearth package:
library(rnaturalearth)
BENLGER <- ne_countries(scale = 'large',
country = c('Belgium', 'Netherlands', 'Germany'),
type = 'countries')
BENLGER <- st_as_sf(BENLGER) %>% st_combine()
mapview(BENLGER)
Calculate distance to the coastline:
# Calculate distance to the coastline:
coastdist <- st_distance(BENLGER, Dutch_shell_sf)
# Combine to dataframe:
Dutch_shell_sf$coastdist <- as.vector(coastdist)
Check if it looks ok (for first 50 records):
mapview(Dutch_shell_sf[1:50,],
zcol = "coastdist")
Select only a selection of the parameters:
library(dplyr)
Dutch_shell_sel <- Dutch_shell_sf %>%
select(datecollected, decimallongitude, decimallatitude,
aphiaid, scientificnameaccepted,
kingdom, phylum, class,
order, family, genus,
subgenus, specificepithet, infraspecificepithet,
Bedabundance = BedAbund....m.2.,
BedWetWtBiom = BedWetWtBiom..g.m.2.,
fishing, bathy, substrate,
BO_phosphate, BO_nitrate, BO_salinity,
BO_sstmax, BO_sstmean,
BO_dissox, BO2_curvelmax_bdmax,
BO2_lightbotmax_bdmax, geometry, coastdist)
write.csv(Dutch_shell_sel,
file = "data/Dutch_shell_variables.csv",
row.names = FALSE,
quote = grep('geometry', colnames(Dutch_shell_sel))
)
write.table(Dutch_shell_sel,
file = "data/Dutch_shell_variables_semicolon.csv",
row.names = FALSE,
sep = ";")
So in the end we have a dataset with following variables: